5 research outputs found

    Statistical methods for the analysis of RNA sequencing data

    Get PDF
    The next generation sequencing technology, RNA-sequencing (RNA-seq), has an increasing popularity over traditional microarrays in transcriptome analyses. Statistical methods used for gene expression analyses with these two technologies are different because the array-based technology measures intensities using continuous distributions, whereas RNA-seq provides absolute quantification of gene expression using counts of reads. There is a need for reliable statistical methods to exploit the information from the rapidly evolving sequencing technologies and limited work has been done on expression analysis of time-course RNA-seq data. In this dissertation, we propose a model-based clustering method for identifying gene expression patterns in time-course RNA-seq data. Our approach employs a longitudinal negative binomial mixture model to postulate the over-dispersed time-course gene count data. We also modify existing common initialization procedures to suit our model-based clustering algorithm. The effectiveness of the proposed methods is assessed using simulated data and is illustrated by real data from time-course genomic experiments. Another common issue in gene expression analysis is the presence of missing values in the datasets. Various treatments to missing values in genomic datasets have been developed but limited work has been done on RNA-seq data. In the current work, we examine the performance of various imputation methods and their impact on the clustering of time-course RNA-seq data. We develop a cluster-based imputation method which is specifically suitable for dealing with missing values in RNA-seq datasets. Simulation studies are provided to assess the performance of the proposed imputation approach

    Application of the EM Algorithm for Mixture Models

    Get PDF
    A developmental trajectory describes the course of behaviour over time. Iden­ tifying multiple trajectories within an overall developmental process permits a focus on subgroups of particular interest. This research introduces a SAS macro program that identifies trajectories by using the Expectation-Maximization (EM) algorithm to fit semi-parametric mixtures of logistic distributions to longitudinal binary data. For performance comparison, we consider full maximization algo­ rithms (e.g. SAS procedure PROC TRAJ) and standard EM, as well as two other EM-based algorithms for speeding up convergence. The simulation study shows that our EM methods produce more accurate parameter estimates than the full maximization methods. The EM-based methodology is illustrated with a longitudinal data set involving adolescents smoking behaviours
    corecore